Assignment of structural domains in proteins using diffusion kernels on graphs

Though proposing algorithmic approaches for protein domain decomposition has been of high interest, the inherent ambiguity to the problem makes it still an active area of research. Besides, accurate automated methods are in high demand as the number of solved structures for complex proteins is on the rise. While majority of the previous efforts for decomposition of 3D structures are centered on the developing clustering algorithms, employing enhanced measures of proximity between the amino acids has remained rather uncharted. If there exists a kernel function that in its reproducing kernel Hilbert space, structural domains of proteins become well separated, then protein structures can be parsed into domains without the need to use a complex clustering algorithm. Inspired by this idea, we developed a protein domain decomposition method based on diffusion kernels on protein graphs. We examined all combinations of four graph node kernels and two clustering algorithms to investigate their capability to decompose protein structures. The proposed method is tested on five of the most commonly used benchmark datasets for protein domain assignment plus a comprehensive non-redundant dataset. The results show a competitive performance of the method utilizing one of the diffusion kernels compared to four of the best automatic methods. Our method is also able to offer alternative partitionings for the same structure which is in line with the subjective definition of protein domain. With a competitive accuracy and balanced performance for the simple and complex structures despite relying on a relatively naive criterion to choose optimal decomposition, the proposed method revealed that diffusion kernels on graphs in particular, and kernel functions in general are promising measures to facilitate parsing proteins into domains and performing different structural analysis on proteins. The size and interconnectedness of the protein graphs make them promising targets for diffusion kernels as measures of affinity between amino acids. The versatility of our method allows the implementation of future kernels with higher performance. The source code of the proposed method is accessible at https://github.com/taherimo/kludo. Also, the proposed method is available as a web application from https://cbph.ir/tools/kludo. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04902-9.


Training data
In this study we used a set of training data for two purposes: (1) to train the single/multi-domain classifier, and (2) to tune bandwidth parameter of the diffusion kernels. Subtracting ASTRAL40 and five benchmark datasets (Benchmark_1, Benchmark_2, Benchmark_3, Islam and Jones) that were used for evaluation from ASTRAL95 left 13350 protein chains to use for the purpose of training and parameter tuning.

Single/multi-domain classifier
An ensemble of classifiers is used to categorize the input protein structure into either "single" or "multidomain" classes. More precisely it is a bagging (bootstrap aggregating) classifier where each of its estimators (weak classifiers) is trained on a balanced bootstrap sample of the training set. The final output of the classifier is calculated by averaging over the outputs of all decision trees. In this section we describe a detailed explanation about the single/multi-classifier in four sub-sections: feature extraction, hyperparameter optimization, classifier performance assessment, and feature importance analysis on the classifier.

Training data labeling
The training set contains the protein chains that there is no an agreement between SCOP and CATH in categorizing them as single-or multi-domain, or their domain decomposition is not available in SCOP and/or CATH. Thus, we used the convention depicted in Table S1 to label such protein chains. The category column assigns a symbol to each case as SS: categorized as single-domain by both SCOP and CATH, MM: categorized as multi-domain by both SCOP and CATH, SM: categorized as single-domain by one of the two databases and multi-domain by the other, S!: categorized as single-domain in one of the two databases and missing in the other, M!: categorized as multi-domain in one of the two databases and missing in the other, !!: missing in both SCOP and CATH. According to the table, we ignored two groups S! and !! to train the single/multi-domain classifier. It led to 6862 protein chains labeled as S (singledomain) and 4684 protein chains labeled as M (multi-domain). Total abundance 13350

Feature extraction
Here we present a description about each of the features and the way they are calculated. According to Figure 2, three overlapping sets are features are extracted: (1) biological features, (2) The complete list of features along with their description is presented in the Table S1.

Radius of gyration
The radius of gyration of a protein structure is an indicator of protein compactness. It is calculated as the root mean square deviation of the atoms' distance from the structure's center of mass. In addition to passing it as an input feature to the single/multi-domain classifier, we use radius of gyration to calculate the bandwidth parameter of the kernel functions.

Network centrality measures
To extract 42 out of the 78 features from a protein structure, calculation of three network centrality measures for each node of the protein graph is required: degree, closeness and betweenness. Degree of a node is defined as the number of links incident upon that node. If a network is weighted, then the weighted degree of a node can be computed by summing up the weights of links connected to that node. Closeness [1] of a node is the average length of the shortest paths to all the other nodes in the graph. Betweenness [2] quantifies the number of times a node appears in the shortest path between any pair of nodes. Since the network is weighted, each centrality measure is calculated in two weighted and unweighted (without considering network weighs) cases.

Weighted variances
In the cases that we calculate weighted variance of alpha-carbon coordinates along a principal component, we use relative hydrophobicity, reversed relative surface accessibility and three (weighted and unweighted) centrality measures described in the paper as weights for the amino acids. In general, a weighted variance is calculated as: where is the weight (importance) of th residue. In this study we took into account the following values as weights: relative hydrophobicity, reverse of the relative surface accessibility, (weighted/unweighted) node degree, (weighted/unweighted) node closeness and (weighted/unweighted) node betweenness.

Hopkins statistic
Hopkins statistic [3] shows tendency of a set of (possibly multi-dimensional) points to be clustered. To calculate Hopkins statistic, given a set of points (in our case alpha-carbon coordinates) a random sample of size ≪ is selected. Then a set of uniformly distributed points is generated. For each point in , two distance measures are defined: is the distance of (1 ≤ ≤ ) in from its nearest neighbor in , and the distance of a randomly picked point (1 ≤ ≤ ) in from its nearest neighbor in . Hopkins statistic is computed as follows: Due to randomness of the Hopkins statistic we repeat the test for 100 times and consider the average of 100 scores as the input feature of the classifier.

Dip statistic
Distinguishing whether a set of points (in our case alpha-carbons) possess a unimodal or multimodal distribution can help us to assess clustering tendency of the points. Hartigan's dip statistic [4] was proposed to measure unimodality of a set of points in one-dimensional space. More precisely, dip statistic shows deviation of a given distribution from a reference (unimodal) distribution. So, the lower the dip statistic, the more likely it is that the given distribution is unimodal. We considered dip statistic along each principal component of alpha-carbon coordinates as input features for the single/multi-domain classifier.
To construct a one-dimensional distribution over each principal component of alpha-carbon coordinates, a normalized histogram with the bin size of 4 Å is calculated.

Clustering coefficient
A triplet in an undirected graph is defined as three nodes that are connected by two (open triplet) or three (closed triplet) edges. So the global clustering coefficient [5] is calculated as: Also, the clustering coefficient for the th node (local clustering coefficient) [6] in a graph is calculated as: where for the node the neighborhood is defined as its directly connected neighbor nodes, and is defined as the number of nodes in (i.e. | |). The average clustering coefficient is then measured as:

Eigenvalues of the Laplacian matrix
One of the ways to assess clustering tendency of a graph, is to measure its connectivity strength. Strongly connected graphs have a less tendency to be clustered into subgraphs. Smallest eigenvalues of the Laplacian matrix of a graph are appropriate measures to represent graph connectivity. Since the smallest eigenvalue of a Laplacian matrix is always zero we ignored it. The magnitude of the second-smallest eigenvalue of the Laplacian matrix (known as Fiedler value) of a graph reflects how well the graph can be partitioned into two disjoint subgraphs (how well connected the graph is). In general, the -smallest eigenvectors of the Laplacian matrix can be used to find a useful -way partitioning [7]. Thus, we use the first 10 non-zero smallest eigenvalues (2 nd to 11 th eigenvalues) of the normalized Laplacian matrix as input features for the single/multi-domain classifier.

Hyper-parameter optimization
The single/multi-domain classifier is an ensemble of base classifiers each trained on a balanced bootstrap sample. Thus, the classifier poses four main hyper-parameters: (1) base classifier type, (2) bootstrap size, (3) the number of base classifiers, and (4) the resampling procedure performed on each bootstrap. To determine the first hyper-parameter, an 80%-20% train-test split evaluation was performed and as a result, decision tree was chosen as the base estimator type from a set of candidate models (the others were MLP neural network, SVM with RBF kernel, naïve Bayes). Also, the bootstrap size (the second hyperparameter) was set equal to the size of the training set (6862 + 4684 = 11546) to cover all data as much as possible. In order to set the two other hyper-parameters, namely, the number of decision trees and the resampling procedure (to balance bootstrap samples), a grid search over the training set using 5-fold crossvalidation was performed. As a result, the combination of 190 decision trees and SMOTE resampling algorithm [8,9] was selected as the best choice. The search space for the grid search consisted of all combinations of {50,60, … ,200} as the number of decision trees and the three resampling procedures: SMOTE, SMOTETomek and TomekLinks..

Classifier performance
Although the single/multi-domain classifier was not the main focus of this study, but it exhibited a significant effect on the performance of the overall pipeline. Figure S2 shows the performance of the classifier based on the three measures: accuracy, Matthews correlation coefficient (MCC) [10] and area under ROC curve (AUC). To calculate the performance measures in the test phase we used a different approach with that of the training phase. If a protein is predicted as single-domain and its main label (according to Table S1) is S or SM, we considered the prediction as true. As the same way, if a protein is predicted as multi-class and its main label is M, M! or SM, we considered the prediction as true. By comparing Figure S2 with overall performance of the method (Table 2 and Figure 9 in the main paper), one can notice that most errors of KluDo are due to misclassification of the proteins into single-or multidomain classes. Also, Figure S3 and Figure S4 show recall, precision and F-measure (F1) for each of the single-domain and multi-domain classes over the test sets. Also, Figure S5 shows ROC curve for each of the datasets by considering multi-domain proteins as positive class.
Since, decision tree was used as the base classifier of the bagging classifier, its interpretability allows us to obtain the degree of importance for each of the features. Figure S6 shows the list of 78 input features of the classifier, sorted by the degree of importance and the degree of importance itself. As seen in the figure, the features that are related to the first principal component of alpha-carbon coordinates are of higher importance in the classification of protein structures into single-or multi-domain classes. Also, the second eigenvalue of the Laplacian matrix that is known as Fiedler value has gained a high degree of importance. It is also true about dip statistic.

Bandwidth determination
In this study the bandwidth value for each kernel function is calculated as follows: where is the radius of gyration of the protein structure and is a coefficient that is determined for each kernel. We inspected the performance of all combinations of kernels and clustering methods by running KluDo on multi-domain structures of the training set (MM category in Table S1) over predefined sequences of values for , while assuming all structures as multi-domain (single/multi-domain classifier was not used). For each kernel a distinct range of values were used for . Figure S7 shows the accuracies based on OL score with the threshold of 85% for different values of . The accuracies are calculated after removing protein chains with fixed OL score across all values of .
Since in most cases there is no a clear peak point in the curves, we decided to choose three values of with highest accuracies instead of choosing a single value. The selected values for are indicated in Figure  S6 with the dotted lines. Table S3 shows the accuracy of KluDo over the multi-domain structures of ASTRAL40 (based on MM category condition) for each of the three values of when using all kernel function-clustering method pairs. For each case two accuracies based on OL and ARI (both with the threshold of 85%) scores are reported. Highlighted cells indicate the best cases on which results are reported in the main text.  w_var_w_cls_pc1  w_var_acc_pc1  cls_var  w_var_deg_pc1  acc_mean  lp_eigval2  w_btw_var  btw_var  hydphob_mean  dip_stat_pc1  deg_var  lp_eigval3  deg_hydphob_corr  clust_coef_al_w  e  clust_coef_g  cls_acc_corr  hydphob_var  cls_hydphob_corr  btw_acc_corr  e_n_ratio  n  deg_mean  lp_eigval4  dip_stat_pc3  w_deg_hydphob_corr  w_var_btw_pc1  w_deg_hydphob_corr  w_var_deg_pc3  btw_hydphob_corr  w_deg_var  dip_stat_pc2  deg_acc_corr  w_cls_acc_corr  w_var_hydphob_pc1  w_cls_var  w_btw_acc_corr  clust_coef_al  acc_var  w_btw_hydphob_corr  hopkins_stat  w_var_w_btw_pc3  w_var_btw_pc3  w_var_acc_pc3  radius_of_gyration  w_e  w_var_hydphob_pc2  w_cls_hydphob_corr  w_e_n_ratio  w_var_w_btw_pc1  w_btw_mean  w_var_btw_pc2  w_deg_mean  w_cls_mean  lp_eigval5  lp_eigval9  cls_mean  w_var_w_cls_pc3  btw_mean  w_var_cls_pc1  w_var_w_deg_pc1  lp_eigval6  lp_eigval7  expl_var_pc1  lp_eigval11  w_var_w_btw_pc2  w_var_acc_pc2  w_var_hydphob_pc3  w_var_deg_pc2  expl_var_pc2  w_var_cls_pc3  w_var_w_deg_pc2  lp_eigval10  lp_eigval8  w_var_w_deg_pc3  w_var_w_cls_pc2 w_var_cls_pc2 expl_var_pc3 kernel k-means spectral clustering

Randomized graph tests
In order to assess the suitability of graph construction procedure we performed a set of randomization tests. For this purpose, the multi-domain chains (based on MM category definition in Table S1) of the three datasets benchmark_1, benchmark_2 and benchmark_3 were chosen. Two different randomization methods were used. In the first test, for each protein, random weights were assigned to the edges of the original graph topology based on a uniform weight distribution whose minimum and maximum values are obtained from the original protein graph constructed by the standard procedure of KluDo. In the second test, both topology and weights were generated randomly. For each protein, graph topology was generated by random rewiring of the original protein graph such that the degree distribution is preserved, and the weights were assigned randomly by the procedure used in the previous test. All decompositions were done using LED as the kernel function and spectral clustering as the clustering method. Also, it was assumed that all the structures consist of at least two domains since only multi-domain proteins were selected for the analysis.
The mean accuracy across 30 runs for each of the tests was calculated. While the mean accuracy from the first test (randomized weights) showed a significant decrease over Benchmark_2 and Benchmark_3, the mean accuracy from the second test (randomized topology and weights) were about to zero over all three datasets. From these results one can conclude that the graph topology contains much more information about the protein structures than edge weights. Figure S8 shows the accuracy of the standard KluDo procedure along with the mean accuracy resulted from the first test (randomized weights) over the multidomain proteins in Benchmark_1, Bebchmark_2 and Benchmark_3. Figure S8: Accuracy of the standard KluDo procedure versus the mean accuracy resulted from weight randomization over the multi-domain proteins in Benchmark_1, Bebchmark_2 and Benchmark_3.

Implementation
The whole project was written in Python 3.7. The Python packages atomium [11] (parsing PDB files), igraph [12] (graph procedures), pyclustertend [13] (Hopkins statistic), unidip [14] (dip statistic), scikitlearn [15] (spectral clustering, atomic contact calculation, single/multi-domain classifier evaluation), imbalanced-learn [16] (scikit-learn wrapper, single/multi-domain classification), tslearn [17] (kernel kmeans clustering) and the command-line program DSSP [18,19] were used in this project. KluDo can be executed as a Windows/Linux/Mac command-line application. We also developed a web application using the Django framework [20] to make KluDo available through the world wide web, so that users can enter a protein chain as well as algorithm-related arguments. Once the algorithm is executed, the result is represented in both text form and 3D structure powered by Jmol [21].
All the parameters in Table 1 besides kernel function, clustering method and lower/upper bound for the number of domains can be set optionally by the users in both command-line and web application. In addition to the algorithm-related arguments, we made it possible for the users to enter a range for the number of domains as an input argument. The source code of this project is available on Github at https://github.com/taherimo/kludo. Also, the web application is accessible from http://cbph.ir/tools/kludo.